Build options - ik

Prerequisites

On Debian/Ubuntu:

apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake

CMake flags

Pass flags to the initial cmake -B build invocation.

Flag	Default	Description
`GGML_NATIVE`	`OFF`	Optimize for the host CPU (`-march=native`). Turn off when cross-compiling.
`GGML_CUDA`	`OFF`	Build with CUDA support. Requires the NVIDIA CUDA Toolkit. Defaults to native CUDA architecture detection.
`CMAKE_CUDA_ARCHITECTURES`	auto	Target a specific GPU compute capability, e.g. `86` for RTX 30-series.
`GGML_RPC`	`OFF`	Build the RPC backend for distributed inference across machines.
`GGML_IQK_FA_ALL_QUANTS`	`OFF`	Enable all KV cache quantization types for Flash Attention (beyond the default `f16`, `q8_0`, `q6_0`, and `bf16`).
`GGML_NCCL`	`ON`	Enable NCCL for multi-GPU communication. Set to `OFF` to disable.
`LLAMA_SERVER_SQLITE3`	`OFF`	Build SQLite3 support into `llama-server` (required for the mikupad web UI).

CPU build example

cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)

CUDA build example

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j$(nproc)

Environment variables

Set these in the shell before invoking llama-server or any other tool.

Variable	Description
`CUDA_VISIBLE_DEVICES`	Restrict which GPUs are visible. Example: `CUDA_VISIBLE_DEVICES=0,2` uses the first and third GPU only.
`GGML_CUDA_ENABLE_UNIFIED_MEMORY`	Set to `1` to enable CUDA Unified Memory, allowing the GPU to access host RAM when VRAM is exhausted. Useful for large models on systems with limited VRAM.

CUDA_VISIBLE_DEVICES=0,2 llama-server --model /models/model.gguf -ngl 999

The only fully supported compute backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are available but not actively maintained.

​Prerequisites

​CMake flags

​Environment variables

Prerequisites

CMake flags

Environment variables